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Abstract 

We describe an implementation of compressible inviscid fluid solvers with block- 
structured adaptive mesh refinement on Graphics Processing Units using NVIDIA's 
CUDA. We show that a class of high resolution shock capturing schemes can be 
mapped naturally on this architecture. Using the method of lines approach with 
the second order total variation diminishing Runge-Kutta time integration scheme, 
piecewise linear reconstruction, and a Harten-Lax-van Leer Riemann solver, we 
achieve an overall speedup of approximately 10 times faster execution on one graph- 
ics card as compared to a single core on the host computer. We attain this speedup in 
uniform grid runs as well as in problems with deep AMR hierarchies. Our framework 
can readily be applied to more general systems of conservation laws and extended 
to higher order shock capturing schemes. This is shown directly by an implementa- 
tion of a magneto-hydrodynamic solver and comparing its performance to the pure 
hydrodynamic case. Finally, we also combined our CUDA parallel scheme with MPI 
to make the code run on GPU clusters. Close to ideal speedup is observed on up to 
four GPUs. 
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Magnetohydrodynamics and plasma 
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1 Introduction 



Graphics Processing Units (GPUs) are specialized for math-intensive highly 
parallel computation, thus more transistors are devoted to data processing 
rather than data caching and flow control like in CPU. So the potential tremen- 
dous performance of general non-graphics computations on GPUs has recently 
motivated a lot of research activities on general-purpose GPU (GPGPU) com- 
puting (see. e.g. Owens et al. 2007 for a review). 



Preprint submitted to Elsevier 



29 October 2009 



NVIDIA introduced the Tesla unified graphics and computing architecture in 
November 2006. The Tesla architecture is built around a scalable array of mul- 
tithreaded streaming multiprocessors (SM). A SM consists of eight streaming 
processor (SP) cores. The Tesla SM uses a new processor architecture called 
single-instruction, multiple-thread (SIMT). The SIMT unit creates, manages 
and executes up to 768 concurrent threads in hardware with zero scheduling 
overhead. The SM also implements barrier synchronization intrinsic with a 
single instruction. The fast barrier synchronization, together with lightweight 
thread creation and zero-overhead thread scheduling support very fine-grained 
parallelism allowing thousands and even millions of threads to be invoked in 
kernel calls to achieve highly scalable parallel programming. 



High performance supercomputing has been important in modern astrophys- 
ical research since it became available. Simulations allow astronomers to per- 
form "experiments" on astronomical objects, collide stars, galaxies, or model 
the entire visible Universe; all situations are clearly impossible to recreate in a 
terrestrial laboratory. Studying the formation of stars, black holes and galaxies 
in the Universe is particularly challenging computationally. Their formation 
involves the nonlinear interplay of a range of physical processes including grav- 
ity, turbulence, magnetic field, shocks, radiation, chemistry, etc. Those ques- 
tions motivated the astrophysical community to develop robust and efficient 
fluid codes with all the relevant physics. 



Studies involving astrophysical fluid dynamics in general are benefitting tremen- 
dously from using spatial and temporal adaptive mesh refinement (AMR). 
This is especially so in the studies of structure formation. For example, the 
radius of a star is 8 orders of magnitude smaller than the size of a molecular 
cloud. A uniform grid code is hopeless. On the other hand, the AMR technique 
has been demonstrated to work well in resolving the large dynamical range 
involved in those problems (e.g. Abel et al. 2002; Wang & Abel 2009). 



The mapping of computational fluid algorithms to GPU however is still at an 
early stage of development. Harris et al. (2003) performed cloud simulations 
using Stam's method (Stam 1999). This method is also used by Liu et al. 
(2004) for 3D flow calculations. Using finite difference methods, Brandvik 
& Pulla (2008) solved uniform grid 3D Euler equations, Elsen, LeGresley & 
Darve (2008) solved 3D Euler equations on a multi-block meshes and Zink 
(2008) solved Einstein's equation with uniform grid. As far as we are aware 
of, this work is the first on mapping an adaptive mesh finite volume solver to 
GPU. 
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2 CUDA 



In 2007, NVIDIA released CUDA for GPU computing as a language extension 
to C (NVIDIA 2009). CUDA makes GPU computing application development 
much easier and more efficient than earlier attempts to GPGPU using various 
shading languages which need to translate the computation to a graphics 
language. 

CUDA's parallelization model is based on abstraction of the GeForce 8-series 
hardware. It allows programmer to define kernels which can be executed in 
parallel by many threads on GPU. Threads are organized into ID, 2D or 3D 
thread blocks, where each block is executed on one SM. The SM maps the 
threads inside a block to the streaming processor (SP) cores and each thread 
executes independently with its own instruction address and register state. 
Synchronization is possible only within a block whereas global synchronization 
between blocks is impossible. Blocks are in turn organized into ID or 2D grid 
of blocks. Each thread can access its thread and block indices by two built-in 
variables threadldx and blockldx. 

The SM's SIMT unit creates, manages, schedules and executes threads in 
groups of 32 parallel threads called warps. The threads in a warp always 
execute a common instruction at a time, but different warps execute indepen- 
dently. As a result, different warps can execute on different branches. This is an 
enormous improvement for branching code compared to previous-generation 
GPUs as the 32-thread warps are much narrower than the SIMD (single- 
instruction multiple-data) width of prior GPUs. However, if threads of a warp 
diverge via a conditional branch, different execution path have to be serial- 
ized, increaing the total number of instructions executed for this warp. So 
branching inside a warp should still be minimized to achieve good efficiency. 

CUDA exposes the hardware memory hierarchy by allowing threads to access 
data from multiple memory spaces. All threads have access to the same global 
memory. Each thread block has a shared memory visible to all threads of 
the block and with the same lifetime as the block. Each thread has a private 
local memory and a set of registers. There are also two additional read-only 
memories accessible by all threads: the constant and texture memory. 

The shared memory is much smaller than global memory, typically 16 kB, 
but it is on-chip so it has very high register-level bandwidth. A typical pro- 
gramming pattern utilizing this fact is to stage data from global memory into 
shared memory, process the data there and then write the results back to 
global memory. 



3 



3 Block-structured adaptive mesh refinement 



In the Berger & Colella block-structured AMR (Berger & Colella 1989), a 
subgrid will be created in regions of its parent grid needing higher resolution. 
The hierarchy of grids is organized in a tree structure. Each grid is evolved as a 
separate initial boundary value problem. The whole grid hierarchy is evolved 
recursively. In this framework, a single grid is a natural unit to be sent to 
GPU for computing. Since the grids are dynamically created, arbitrary grid 
dimensions can arise in real applications. Thus the key issue for a paralleliza- 
tion scheme to couple efficiently to AMR is to let it work for arbitrary grid 
dimensions. Many parallelization models previously proposed for uniform grid 
GPU fluid solvers are thus not appropriate to be coupled with AMR. 

In this work, we use the implementation of Berger & Colella AMR in the 
publicly available hydrodynamics code Enzo (Enzo 2009). Previously we have 
implemented our own hydrodynamics and magnetodynamics code using the 
Enzo AMR framework (Wang et al. 2008; Wang & Abel 2009). Often, the hy- 
drodynamics/magnetohydrodynamics solver is the computationally dominant 
part (typically more than 10 times expensive than the AMR part in single 
core run). So in this work we consider the mapping of hydrodynamics solver 
onto GPU while leaving the creation and refinement of the grid hierarchy on 
CPU. 



4 Fluid solver 



The flows involved in star and galaxy formation are highly supersonic and the 
Reynolds number in those flows are very large (Spitzer 1978). Thus we are 
interested in solving the equations of compressible inviscid hydrodynamics. 
However, the parallelization scheme we implemented can be applied to any 
hyperbolic conservation laws. We will demonstrate this explicitly in section [7] 
by implementing a AMR magnetohydrodynamics (MHD) solver on GPU. 

The equations of compressible inviscid hydrodynamics can be written in the 
form of conservation laws, 

dU dF x dF y dF z 
dt dx dy dz 



The conserved variable U is given by 

U = (p, pv x , pv y , pv z , pE) T } (2) 
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where p is density, V{ are the three components of velocity for i = x,y, z, 
E = v 2 /2 + e is the total energy and e is the internal energy. 

The fluxes are given by 



pv x , pv 2 x + p, pv y v x , pv z v x , p{— + h)v x ] , (3) 



T 



,V 



2 



F y =[pVy } pV X Vy,pV y +P,pV Z Vy,p(— + h)Vy) , (4) 

,2 



T 



F z = ypv z , pv x v z , pv y v z , pv\ + p, p(y + h)v z J , (5) 

where h = e + p/p is the enthalpy. In this work we assume the ideal gas 
equation of state: 

P = (l ~ l)pe, (6) 



where 7 is the adiabatic index. 

Many numerical schemes have been developed to solve hyperbolic conservation 
laws of the form (pQ). We are interested in a class of finite volume methods 
called high-resolution shock- capturing (HRSC) schemes developed since the 
mid-1980s. Those schemes are designed to capture the correct shock speed even 
with very low resolutions (see e.g. LeVeque 2002 for a comprehensive review). 
This property makes it ideal for adaptive mesh fluid simulations where shocks 
outside the refined regions may not be well resolved. HRSC schemes represent 
the most popular scheme in modern astrophysical codes (e.g. Stone et al. 2008; 
Mignone et al. 2007; Wang et al. 2008). Thus we will focus on mapping the 
HRSC schemes onto GPU. 

We use the method of lines (MOL) to discretize the system flTJ spatially, 



iff pi t?x fpy py fpz _ fpz 

aU i,j,k _ _ r i+l/2,j,k r i-l/2,j,k _ 1 i,j+l/2,k 1 i,j-l/2,fe r i,j,k+l/2 r i,j,k-l/2 

dt ~ Ax Ay Az 

= HU), (7) 

where i,j,k refers to the discrete cell index in x,y,z directions, respectively. 
^i±i/2,i,fc! ^i,j±\/2,k an d Fi,j,k±i/2 are ^ ne h uxes a t the cell interface. 

As discussed by Shu & Osher (1988), if one uses a high order scheme to recon- 
struct flux spatially, one must also use the appropriate multi-level total vari- 
ation diminishing (TVD) Runge-Kutta schemes to integrate the ODE system 
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Fig. 1. The stencil for the calculation of flux F x _ 1 by equations (fT0|) to (fT2j) . 

1 2 

([7]). So instead of the forward Euler time integration in the original Berger- 
Colleia AMR (Berger & Colella 1989), we have implemented the second order 
TVD Runge-Kutta scheme: 

U {1) = U n + AtL(U n ), 

U n+1 = -IT + hj^ + ~AiL(?7«), (8) 

See Wang et al. (2008) for further details of the implementation of Runge- 
Kutta scheme in an AMR framework. 

Generally speaking, there are two classes of spatial reconstruction schemes 
(LeVeque 2002). One is reconstructing the unknown variables at the cell in- 
terfaces and then use exact or approximate Riemann solver to compute the 
fluxes. Another is direct flux reconstruction, in which we reconstruct the flux 
directly using the fluxes at cell center. We will adopt the first class in this 
work. But our framework can also be applied directly to the second class. 

For example, to calculate the flux F*_± at cell interface x { _i (see FigJT]), we 
need to reconstruct the left and right states u i _i l and %_i r . In this work, we 
use the piecewise linear method (PLM) (van Leer 1979) for reconstruction, in 
which the values of primitive variable u at Xi-2, Xj-i, Xi, Xi + i are needed. The 
formula for PLM reconstruction is 

u i-\,i = Ui-i + minmod - w;_ 2 )#, {u{ - Wi-i)0, {ui - Mj_ 2 )/2) , (9) 

r = u i + minmod - u i+1 )9, (u^ - Uj)0, - u i+l )/2) , (10) 

where < 9 < 2 and the minmod function is given as 

minmod(a, b, c) = 0.25(sign(a) + sign (6)) | sign (a) + sign(c)|min(|a|, |6|, |c|)(ll) 

With those two states, we use the Harten-Lax-van Leer (HLL) approximate 
Riemann solver (Harten et al. 1983) to calculate the flux F x _± by, 

1 2 

Fl, = 0.5(F x (u i _i l ) + F*( Ui _iJ - X ( Ui _ hr - Ui _ hl )), (12) 

where A = max (max(0, Xf, A+),max(0, — Ar, — A~)^ and Xf = v Xji ± c Sji for 
i = I, r. 
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The important property of this MOL approach is that the cells required for 
flux calculation in any direction are all along that direction. As we will see 
below, this property is crucial for a efficient parallelization scheme in 2D and 
3D. 



5 Mapping fluid solver to GPU 



5.1 One dimensional case 



In this section, we discuss a parallelization scheme for the ID case. In the next 
section we will extend this scheme to 2D and 3D. 

First, we note that one drawback of putting all the steps of the 2nd order 
Runge-Kutta scheme (jSJ) to GPU is that it needs to send both the old value 
and intermediate value to GPU in the second step. From our experience, data 
transfer between CPU and GPU is quite expensive so it needs to be min- 
imized. This would also double the GPU memory usage. However, the 2nd 
order Runge-Kutta scheme can also be written as 



U {1) = U n + AtL(U n ), 
U® = U® + AtL(uM), 

u n+l = hj n + hj^ 2) . (13) 

Thus, if we only put the first and second step to GPU, which is the com- 
putationally intensive part, and leave the third part on CPU, which is just 
the addition of two vectors, we may still get large speed-up. This will be the 
strategy we take. 

Since the first and second steps have exactly the same form, we only need 
to write one routine computing = U n + AtL{U n ) and call it twice. The 
pseudocode for this routine reads: 



allocate memory for primitives and fluxes on GPU 

copy primitives to GPU 

call kernel for flux computation 

call kernel for L(U) computation 

call kernel for time update U {1) = U n + AtL(U n ) 

copy [/« back to CPU 

free memory on GPU 



7 



PrifRjUmfn + 3] 
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i i i i-j i i 

Fig. 2. The parallelization scheme inside a single block of size n. Every thread, 
except the first and last, loads the data in the cell to the right of it. The first thread 
loads in two more cells to the left of it while the last thread loads in one more cell 
to the right of it. 

Since the computation of flux vector L(U) and time update use only local 
information and involve only one read and one computation, their implemen- 
tations are simple: just launch one thread for every active cell. 

The most computational intensive and tricky part is the flux computation, 
which happens at cell interfaces and needs a different treatment compared to 
the other two kernels. Our basic scheme is that every flux is calculated by a 
single thread (see Fig. |2J). So for a problem with N grid cells (including the 
ghost cells), the number of threads should be iV — 3. We take the number of 
threads per block to be a fixed number n = 64. Thus the number of blocks 
to be launched is int(iV/64) + 1. The final block will have some threads that 
do not correspond to flux computation, we add a conditional statement in the 
kernel to just let those threads do nothing. Since this affect only the last one 
or two warps, this remains efficient. 

For example, in the zth block, we want to calculate n fluxes at cell interfaces 
{xk-i/2, Xk+i/2, ...,x k+n -3/ 2 } with k = (i - l)n + 2. According to Eqs. ([TDD to 
(fT2l) . every thread needs 4 cells around it to calculate the flux. So in total, 
n + 3 cells {xk-2,Xk-i, ...,Xk+n} are needed for the calculation. 

As discussed in section |2J the memory bandwidth is much higher on the shared 
memory. So in the computation of a block, we will first load the n+3 cells from 
global memory to a shared memory array PrimLine[n+3]. After the computa- 
tion is finished, we will then write the flux back to global memory. As shown 
in Fig[5J our memory loading scheme goes as follows: thread % will read in 
primitive variables in cell Xk+i+2- To read in all the necessary data, the zeroth 
thread will then need to read in two more cells Xk and Xk+i and the (n-l)th 
thread will read in one more cell Xk+ n +2- 

The kernel code for this flux computation are: 

#define CUDA_BLOCK_SIZE 64 
#define NEQ_HYDRO 5 
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/ / get thread and block index 
const long tx = threadldx.x; 
const long bx = blockldx.x; 

int igrid = tx + bx*CUDA_BLOCK_SIZE; // array index in the data field 
int idx_prim, idx_priml; 

__shared__ float PrimLine[NEQ_HYDRO*(CUDA_BLOCK_SIZE+3)]; // input primitive variable 
__shared__ float FluxLine[NEQ_HYDRO*CUDA_BLOCK_SIZE]; // output flux 

if (igrid >= 2 && igrid <= size - 2) { // only do flux computation for active cells 

/ / load data from device to shared. 
idx_priml = (tx+2)*NEQ_HYDRO; 
PrimLine[idx_priml++] = Rho [igrid]; 
PrimLine[idx_priml++] = Eint [igrid]; 
PrimLine[idx_priml++] = Vx[igrid]; 
PrimLine[idx_priml++] = Vy[igrid]; 
PrimLine[idx_priml] = Vz [igrid]; 

//if the first, load in two more cells for boundary condition 
if (tx == 1 1 igrid == 2) { 
for (int i = -2; i <=-l; i++) { 

idx_prim = igrid + i; 

idx_priml = (i+tx+2)*NEQ_HYDRO; 

PrimLine[idx_priml++] = Rho[idx_prim]; 

PrimLine[idx_priml++] = Eint [idx_prim] ; 

PrimLine[idx_priml++] = Vx[idx_prim]; 

PrimLine[idx_priml++] = Vy[idx_prim]; 

PrimLine[idx_priml] = Vz [idx_prim] ; 

} 

} 

//if the last, load in one more cell for boundary condition 
if (tx == CUDA_BLOCK_SIZE - 1 1 1 igrid == size - 2) { 

idx_prim = igrid + 1; 

idx_priml = (tx+3)*NEQ_HYDRO; 

PrimLine[idx_priml++] = Rho [idx_prim] ; 

PrimLine[idx_priml++] = Eint [idx_prim] ; 

PrimLine[idx_priml++] = Vx[idx_prim]; 

PrimLine[idx_priml++] = Vy[idx_prim]; 

PrimLine[idx_priml] = Vz[idx_prim]; 

} 

} 
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Fig. 3. This figure illustrates how the 2D grid with active (white) and ghost (shaded) 
cells are mapped to ID grids in x and y sweeps. 

/ / synchronize to ensure all the data are loaded 
__syncthreads(); 

if (igrid >= 2 && igrid <= size - 2) { 
// the main computation: calculating the flux at tx 
LLF_PLM(PrimLine, FluxLine, tx); 

/ / copy ID Flux back to Flux 
idx_priml = tx*NEQ_HYDRO; 
FluxD [igrid] = FluxLine [idx_priml++]; 
FluxSl [igrid] = FluxLine [idx_priml++]; 
FluxS2 [igrid] = FluxLine [idx_priml++]; 
FluxS3[igrid] = FluxLine[idx_priml++]; 
FluxTau [igrid] = FluxLine [idx_priml ]; 

} 

5.2 Two and three dimensional case 

In higher dimension, the main trick is that in the MOL, 2D and 3D flux com- 
putation can be reduced to ID problems and thus the parallelization scheme 
discussed above for ID problem can be applied directly to 2D and 3D prob- 
lems. 

For example, in 2D application of MOL, one sweeps through a ID grid of 
lines extending in the other direction, calculating the fluxes on every line in 
every sweep. To compute flux in all two directions, one does two sweeps. The 
trick is that in every sweep, we conceptually regard the 2D grid to be a large 
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Table 1 

Technical specifications of NVIDIA's Quadro FX 5600 graphics card. 



Number of streaming multiprocessors (SM) 16 

Number of streaming processors per SM 8 

Warp size 32 

Parallel data cache 16 kB 

Number of banks in parallel data cache 16 

Number of 32-bit registers per SM 8192 

Clock frequency of each SM 1.35 GHz 

Frame buffer memory type GDDR3 

Frame buffer interface width 384 bits 

Frame buffer size 1.5 GB 

Constants memory size 64 kB 

Clock frequency of the board 800 MHz 

Host bus interface PCI Express 



ID grid, including the ghost cells, which is continuous in the direction we are 
sweeping. Thus, when we apply thread decomposition to the grid, we send the 
large ID array to it and use exactly the same scheme discussed in previous 
section for ID problems. This is illustrated in Fig. [3] for a 2D problem, where 
we show the original 2D grid and the ID grid sent to the kernel for x and y 
sweeps. The 3D case is exactly the same. The additional cost of this method 
is that we will also calculate fluxes for all the ghost cells since now we regard 
them as normal cells so some calculation is wasted. However, this is only a 
small cost. But the gain is that the parallelization scheme now works for any 
grid dimensions. One only need to use two "if statements to handle the first 
and last points in the grid, which will lead to execution branching only in the 
first and last two warps. Note that in y sweeping, when reading from global to 
shared memory, the reading is from non-continuous locations. Thus reading is 
non-coalesced in those cases. We have also experimented reshuffling the large 
3D array in CPU so that the reading can be coalesced but we found this leads 
to lower performance because of the reshuffling cost on CPU and additional 
data transfer. 
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Fig. 4. Density profile at t = 0.14 for the Sod test problem with N = 100. Solid line 
is the CPU solution and squares are the GPU solution. 

6 Results 

All the problems in this work are run on a Quadro FX 5600 card. For reader's 
convenience, the technical specifications of Quadro FX 5600 card are listed in 
Table 1. The corresponding CPU comparison cases are run on a single 3 GHz 
core. 



6.1 ID Sod problem 

To evaluate the accuracy of the GPU solution, we first run a ID Sod shock 
tube problem (Sod 1978) with both the CPU code and GPU code. The initial 
condition for this problem is two uniform states separated at x = 0.5. The left 
and right density and pressure are pi = l,pi = 1 and p r = 0.125,p r = 0.1. The 
initial velocity is zero everywhere and the adiabatic index is 7 = 1.4. We use 
100 grid points for this test. The result is shown in FigHJ It can be seen that 
the GPU solution agrees with the CPU solution very well. This validates our 
GPU implementation of fluid solver. 

6.2 3D Sedov-Taylor blast wave 

Next, we study the performance of GPU code in 3D using the Sedov-Taylor 
blast wave problem (Sedov 1959). This section will use a uniform grid. An 
AMR example is given in the next section. 

To set up the problem, we deposit a total energy E = 1 into a spherical region 
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Fig. 5. Running time of a single timestep for CPU code (asterisk) and GPU 
code (plus) for 3D uniform grid Sedov- Taylor blast wave problem with grid sizes 
N = 32 3 , 64 3 , 128 3 , 200 3 , 256 3 . The diamonds show the ratio of the CPU and GPU 
running time. 

of radius r = 0.1 (the "explosion region") at the center of the simulation 
box which has length 1. The pressure inside the explosion region can then 
be calculated as p = 3(7 — 1)E j '(47^) , where 7 = 1.4 for this problem. The 
initial density is uniform throughout the box with p — 1 and the pressure is 
set to a small value p = 1CU 5 outside the explosion region. 

Fig. [5] shows the performance comparison of the CPU and GPU code for 
uniform grid sizes N = 32 3 , 64 3 , 128 3 , 200 3 , 256 3 . It can be seen that we get 
fairly uniform 8 — 10 times speed-up for two orders of magnitude difference in 
grid sizes. This is very encouraging as a wide variety of grid sizes can arise in 
real AMR applications and a uniform speed-up for a large range of grid sizes 
is a necessary condition for good performance in AMR applications. 



6.3 3D cloud disruption with AMR 

In this section we show a case of cloud disruption by blast wave, demonstrating 
that our implementation also lead to good speed-up on AMR applications. We 
put a uniform density cloud with radius 0.01 at distance 0.15 away from the 
center of the blast wave. The cloud is 10 times denser than the medium and 
is in pressure equilibrium with the medium. The topgrid has resolution 64 3 
and we use six levels of refinement to resolve the cloud disruption, which 
correspond to an effective uniform resolution 4096 3 . 
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Fig. 6. Density slice showing the disruption of a cloud by a Sedov- Taylor blast wave. 
The topgrid resolution is 64 3 . Six levels of refinement is used to resolve the cloud 
disruption. The dotted lines show the boundary of subgrids. 

For this problem, the running time of a single timestep in GPU case is ~ 8 
times faster than the CPU case. This is consistent with the behavior we saw 
in last section for uniform grid tests with various grid sizes. 



7 Magnetohydrodynamics 

Magnetic field has been known to play a very important role in the formation 
of stars (Shu et al. 1987) and its role in galaxy formation has also been in- 
vestigated recently (Wang & Abel 2009). Thus for astrophysical applications, 
a MHD solver is of great interest. Our parallelization scheme discussed above 
can be easily extended to the case of MHD. In this section, after discussing 
the MHD equations, we will show some results of this GPU MHD solver. 

7. 1 Equations 

MHD equations can also be written in the form of conservation laws. Thus any 
schemes for hydrodynamics in principle can also be applied to MHD. The main 
numerical problem of solving MHD equations is cleaning up the numerically 
generated magnetic monopoles as magnetic field is divergence-free physically. 
Various schemes have been proposed for this purpose and there is still no 
universal agreement which scheme is the best (see Toth 2000 for a comparison 
of various schemes). In this work, we will adopt the so-called hyperbolic clean 
approach (Dedner et al. 2002), which nicely fit in our developed framework. 
This may not be the best scheme for all applications. But other schemes, 
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like the projection scheme, requires the solution of a Poisson equation. Thus 
mapping them to GPU requires additional work on sparse matrix solvers. 

Following Dedner et al. (2002), we consider the generalized Lagrange multi- 
plier (GLM) formulation of the MHD equations, which can be written in the 
conservative form (JT]) with the conserved variables given by 

U = (p, pv x , pv y , pv z , pE + B 2 /2, B x , B y , B z , ^) T , (14) 

where Bi with i = x,y,z are the three components of magnetic fields and 
ip is the additional scalar field introduced in the GLM formulation for the 
divergence cleaning. 

And the fluxes are given by 

F x = (pv x , pv 2 x +p + B 2 /2 - B 2 X , pv y v x - B y B x , 

v 2 

pv z v x - B Z B X , p(— + h)v x + B 2 v x - B X B ■ v, 

V>, v x B y - v y B x , -v z B x + v x B z , c 2 h B x ) T , (15) 

F V = ( P Vy, P V X Vy - B x By, fW^ + J> + ^ / ^ ~ B J , 

v 2 

pv z v y - B z B y , p{— + h)v y + B 2 v y - ByB ■ v, 

VyB Z - V Z By, 4>, ~V X By + VyB X , (^By) T , (16) 

F z = (pv z , pv x v z - B X B Z , pv y v z - B y B z , pv 2 + p + B 2 /2 - B 2 , 
v 2 

P(~2 + h ) v * + B2v * ~ B * B ■ u ' 

-v y B z + v z B y , v z B x - v x B z , i/;, c 2 h B z ) T , (17) 

where Ch is a constant controlling the propagation speed and damping rate of 
V ■ B (Dedner et al. 2002). 

Solving the GLM-MHD system in our framework is straightforward. All we 
need to do is to add the additional primitive variables to our hydrodynamics 
solver. 

7.2 ID Brio- Wu problem 

First, to evaluate the accuracy of the GPU solution, we run a ID Brio-Wu 
shock tube problem (Brio & Wu 1988) with both the CPU code and GPU 
code. The setup of Brio-Wu problem is similar to the Sod problem discussed 
in section [67T| with two uniform states separated at x = 0.5. But here there is 
non-zero magnetic field in the initial condition. The left and right states are 
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Fig. 7. Density profile at t = 0.1 for the Brio-Wu test problem with N = 200. Solid 
line is the CPU solution and squares are the GPU solution. 
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Fig. 8. Running time of MHD code on a single timestep for CPU code (asterisk) 
and GPU code (plus) for 3D uniform grid Sedov- Taylor blast wave problem with 
grid sizes N = 32 3 ,64 3 , 128 3 ,200 3 . The diamonds are the ratio of the two running 
times. 



Pl = 1,Pl = 1,v x l = 0,v y L = 0,B xL = 0.75, B yL = 1 and p R = 0.125,p L = 
0.1, v x l = 0,v y L = 0,B xL = 0.75, ByR = —1. The adiabatic index is taken to 
be T = 2. We use 200 grid points for this test. The result is shown in FigJTl 
It can be seen that the GPU solution agrees with the CPU solution very well. 
This validates our GPU implementation of MHD solver. 
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Fig. 9. MHD turbulence simulation: density iso-surfaces with magnetic field lines at 
one dynamical time. 

7.3 Sedov- Taylor Blast Wave 

To compare the performance of the MHD solver to the Hydrodynamics solver, 
we apply the MHD solver to the Sedov- Taylor blast wave problem discussed 
in section [6T2l Fig. [8] shows the performance comparison of the CPU and GPU 
MHD solvers for uniform grid sizes N = 32 3 , 64 3 , 128 3 , 200 3 . Note that for 
the MHD case, the memory requirement for a 256 3 run is 1.81 GB (three- 
primitives, flux and L(C/)-copies of the nine MHD fields in a grid size 256 3 ). 
Thus it cannot fit in the 1.5 GB memory of a Quadro FX 5600 card. 

From FigJB]we can see that in the MHD case we also get 8 — 10 speedup for 
a large range of grid sizes. This demonstrates the efficiency of our scheme as 
applied to more complicated physical problems. 

74 MHD turbulence 

Finally, we present the results of an application of our GPU MHD solver to a 
problem that is of great astrophysical interest related to star formation: the 
decay of MHD turbulence (Stone et al. 1998; Mac Low et al. 1998). Following 
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Fig. 10. Time evolution of averaged Mach number (a) and total magnetic energy 
(b). 

Stone et al. (1998), we set initial density to be uniform. An isotropic turbulent 
velocity with Burger-like power spectrum P(k) oc k~ 2 is imposed. The initial 
Mach number is M. = 10. We use an isothermal equation of state by setting 
7 = 1.001. The resolution is taken to be uniform 128 3 . Fig. [9]shows the results 
of density field after one dynamical time. The code takes about 10 minutes 
to evolve to this point. Fig. [TD] shows the time evolution of the average Mach 
number and total magnetic energy. Consistent with the findings in Stone et 
al. (1998), a significant fraction of the kinetic energy decays in one dynamical 
time, and the presence of magnetic field cannot delay it. Those results are 
not new. However, the much shorter running time of the GPU code makes it 
possible to do many runs in a reasonable time to build up statistics, which 
is crucial for some aspects of star formation such as the core mass function. 
Currently, this has only been possible in 2D (e.g. Basu & Ciolek 2009). 



8 GPU cluster computing 

To be competitive with current state of the art simulations which are usually 
run on large CPU clusters, a GPU implementation should also be MPI parallel. 
A simple setup would be building a small CPU cluster with each CPU node 
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Fig. 11. Speedup of the MPI-CUDA hybrid code as a function of the number of 
GPUs for uniform 128 3 resolution Sedov- Taylor blast wave simulations. 

containing one or a few GPU cards. As a simple experiment, we built a 4 nodes 
CPU cluster with 4 Mac Pro workstations and a NVIDIA GeForce 8800 GT 
graphics card on each node. The CPU nodes communicate with each other 
using the standard MPI library over Gigabit ethernet. 

As discussed above, in our scheme, a grid is the basic unit of GPU computing. 
This makes it trivial to combine our scheme with the standard MPI parallel 
scheme of Berger-Collela AMR, which also use grids as the basic unit for 
distribution among processors. The code first calls MPI library to distribute 
grids to different processors. Then each processor sends its grids, one by one, 
to its GPU for computing. We have tested the resulting MPI-CUDA hybrid 
code on up to 4 GPUs. The results of the Sedov- Taylor blast wave simulations 
with a uniform 128 3 resolution using 1, 2 and 4 GPUs are shown in Fig. [TTJ It 
shows that our code archived very good, close to ideal speedup for up to four 
GPUs. Thus it seems realistic that large GPU clusters may offer significant 
cost saving as compared to a CPU clusters giving the same performance the 
magneto hydrodynamical problems our implementation can study. 



9 Discussions and future directions 



In this work we described how to map HRSC schemes for hyperbolic con- 
servation laws to GPU using NVIDIA's CUDA. We demonstrated that our 
framework as applied to the equations of inviscid compressible hydrodynam- 
ics and MHD can lead to a significant speedup. Specifically, on a Quadro FX 
5600 card, saw approximately a factor of ten speedup compared to a single 3 
GHz CPU core. 
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An important purpose of our GPU parallel scheme design is achieving good 
scalability. In typical fluid simulations with as many as millions of cells per 
grid, our parallelization scheme will launch millions of threads at the same time 
on GPU to perform the computation. This exceeds the Quadro FX 5600's 
ability of running 12288 threads concurrently by more than two orders of 
magnitude. Thus we expect that the speedup factor of our parallelization 
scheme will increase linearly with the number of SMs on the GPU. This makes 
our scheme highly scalable to future generation of graphics hardwares. 

One important topic we would like to concentrate on in the near future is 
sparse matrix solvers for Poisson equation. If one can also speed up Poisson 
solvers by a similar factor on GPU, then coupled with the fluid solvers we 
implemented in this work, a whole range of astrophysical simulations will be 
open to processing on the GPU as we can then model astrophysical fluids with 
self-gravity, viscosity and other non-ideal effects. An implementation of Pois- 
son solver will also make it possible to use projection method for divergence 
cleaning and implicit fluid solvers. 
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